Skip to content

Conversation

@enryH
Copy link
Collaborator

@enryH enryH commented Sep 16, 2025

Description

Add a subpackage (as new model) to core.

Tasks Checklist

  • The folder names defines the name of the subpackage (module)
  • Add the user-facing functions to the __init__.py in the new folder, so that
    they are available when the subpackage is imported.
  • Used pydantic since only dictionary outputs in my learning example...
    Create Pandera schema in a file with subpackage name in the src/acore/types folder.
    Optimal is to have only one output schema of results per subpackage or module.
  • Add a relatively small and public dataset to the data folder, or reuse an existing one for testing
  • Create an api example jupyter notebook in the docs/api_examples folder with that data
  • Use jupytext to sync the Jupyter notebook with a Python script
  • Update index.rst file in the docs folder with the new example
  • Create test script in the /test folder with the name of the subpackage or module
    using pytest or unittests to test your new functionality.

@angelphanth angelphanth marked this pull request as ready for review October 30, 2025 13:27
Copy link
Collaborator Author

@enryH enryH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test it now.

@@ -0,0 +1,53 @@
# preprocessing
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this to something similar to compositional.py - I would like to not have it specific to microbiome at best, but we can group it here for now. Maybe you could move the file as compositional.py to transform/compositional.py?

@@ -0,0 +1,25 @@
eff_id,inf_id,sampling_location,sampling_read,eff_abundance,inf_abundance
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a README.md to the folder briefly putting a link where this originates from? And maybe if it is a subset, a small preprocessing script?

@enryH enryH requested a review from Copilot November 7, 2025 08:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive permutation testing functionality to the acore library, including implementations for paired and independent sample tests, chi-squared tests, and Jaccard similarity calculations. The changes also include documentation, examples, and minor code quality improvements.

  • Implements three main permutation test functions: paired_permutation, indep_permutation, and chi2_permutation
  • Adds Jaccard similarity calculations for graph comparison
  • Includes comprehensive test suite and tutorial documentation with real-world metagenomics example

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
src/acore/permutation_test/init.py Core permutation test implementations with support for multiple metrics
src/acore/permutation_test/internal_functions.py Helper functions for permutation and contingency table operations
src/acore/permutation_test/jaccard.py Jaccard similarity calculation functions for set comparison
src/acore/types/permutation_test.py Pydantic model for permutation test results
tests/test_permutation_test.py Comprehensive test suite for permutation test functions
docs/api_examples/permutation_testing.py Tutorial demonstrating permutation testing on metagenomics data
docs/api_examples/permutation_testing.ipynb Jupyter notebook version of the tutorial
docs/index.rst Updated documentation index to include permutation testing tutorial
src/acore/microbiome/internal_functions.py New CLR transformation functions for compositional data
example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv Example dataset for tutorial
src/acore/multiple_testing/init.py Code style fix: changed membership test to use not in operator
src/acore/enrichment_analysis/init.py Code style fix: changed membership tests to use not in operator
src/acore/decomposition/umap.py Added blank line for formatting consistency
src/acore/io/uniprot/uniprot.py Removed extra blank lines
CONTRIBUTING.rst Fixed installation command quoting and corrected PR template path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +51
print(result)
print(ttest_rel(cond1, cond2))
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statements should be removed from the test code. These print statements (lines 50-51) will clutter the test output.

Suggested change
print(result)
print(ttest_rel(cond1, cond2))

Copilot uses AI. Check for mistakes.
#
# In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).
#
# The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors in this line: "wastewaster" should be "wastewater" and "inffluent" should be "influent".

Suggested change
# The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).
# The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058).

Copilot uses AI. Check for mistakes.
# ## Data preparation details
#
# ### Downloading
# The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors: "inffluent" should be "influent" and "EFFF" should be "EFF".

Copilot uses AI. Check for mistakes.
#
# The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset.
#
# If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance.
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "sizze" should be "size".

Suggested change
# If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance.
# If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance.

Copilot uses AI. Check for mistakes.
"## Data preparation details\n",
"\n",
"### Downloading\n",
"The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.\n",
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors: "inffluent" should be "influent" and "EFFF" should be "EFF".

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +33
def random_choice_equal():
return np.random.choice(
["A", "B", "C", "D"], size=100, replace=True, p=[0.25, 0.25, 0.25, 0.25]
)


@pytest.fixture
def random_choice_unequal():
return np.random.choice(
["A", "B", "C", "D"], size=100, replace=True, p=[0.1, 0.25, 0.05, 0.6]
)
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test fixtures that use random number generation without a fixed seed can lead to non-reproducible test failures. Consider using np.random.default_rng(seed=...) or np.random.seed() to ensure reproducibility.

Copilot uses AI. Check for mistakes.
# %%
from acore.permutation_test import paired_permutation

# optional choice of random number generatorfor repro
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in comment: "generatorfor" should be "generator for".

Suggested change
# optional choice of random number generatorfor repro
# optional choice of random number generator for repro

Copilot uses AI. Check for mistakes.
"\n",
"In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).\n",
"\n",
"The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).\n",
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors in this line: "wastewaster" should be "wastewater" and "inffluent" should be "influent".

Suggested change
"The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).\n",
"The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058).\n",

Copilot uses AI. Check for mistakes.
"source": [
"from acore.permutation_test import paired_permutation\n",
"\n",
"# optional choice of random number generatorfor repro\n",
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in comment: "generatorfor" should be "generator for".

Suggested change
"# optional choice of random number generatorfor repro\n",
"# optional choice of random number generator for repro\n",

Copilot uses AI. Check for mistakes.
from acore import permutation_test as pt
import pytest
import numpy as np
from scipy.stats import ttest_rel, ttest_ind, mannwhitneyu, wilcoxon
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'ttest_ind' is not used.
Import of 'mannwhitneyu' is not used.
Import of 'wilcoxon' is not used.

Suggested change
from scipy.stats import ttest_rel, ttest_ind, mannwhitneyu, wilcoxon
from scipy.stats import ttest_rel

Copilot uses AI. Check for mistakes.
@enryH
Copy link
Collaborator Author

enryH commented Nov 7, 2025

@angelphanth Definitely 'squash and merge` this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants